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Abstract 

Wc consider a bandit problem which involves sequential sampling from two populations 
(arms). Each arm produces a noisy reward realization which depends on an observable 
random covariate. The goal is to maximize cumulative expected reward. We derive general 
lower bounds on the performance of any admissible policy, and develop an algorithm whose 
performance achieves the order of said lower bound up to logarithmic terms. This is done 
by decomposing the global problem into suitably "localized" bandit problems. Proofs blend 
ideas from nonparametric statistics and traditional methods used in the bandit literature. 
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1 Introduction 

The seminal paper of Robbins (1952) introduced an important class of sequential optimiza- 
tion problems, otherwise known as multi-armed bandits. These models have since been used 
extensively in such fields as statistics, operations research, engineering, computer science and 
economics. The traditional two-armed bandit problem can be described as follows. Consider two 
statistical populations (arms), where at each point in time it is possible to sample from only one 
of the two and receive a random reward dictated by the properties of the sampled population. 
The objective is to devise a sampling policy that maximizes expected cumulative (or discounted) 
rewards over a finite (or infinite) time horizon. The difference between the performance of said 
sampling policy and that of an oracle, that repeatedly samples from the population with the 
higher mean reward, is called the regret. Thus, one can re-phrase the objective as minimizing 
the regret. 

The original motivation for bandit-type problems originates from treatment allocation in 
clinical trials; see, e.g., Lai and Robbins (1985) for further discussion and references therein. 
Here patients enter sequentially and receive one of several treatments. The efficacy of each 
treatment is unknown, and for each patient a noisy measurement of it is recorded. The goal is 
to assign as many patients as possible to the best treatment. An example of more recent work 
can be found in the area of web-based advertising, and more generally customized marketing. 
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An on-line publisher needs to choose one of several ads to present to consumers, where the 
efficacy of these ads is unknown. The publisher observes click-through-rates (CTRs) for each 
ad, which provide a noisy measurement of the efficacy, and based on that needs to assign ads 
that maximize CTR. 

When the populations being sampled are homogenous, i.e., when the sequential rewards are 
independent and identically distributed (iid) in each arm, Lai and Robbins (1985) proposed a 
family of policies that at each step compute the empirical mean reward in each arm, and adds to 
that a confidence bound that accounts for uncertainty in these estimates. These so-called upper- 
confidence-bound (UCB) policies were shown to be asymptotically optimal. In particular, it is 
proven in Lai and Robbins (1985) that such a policy incurs a regret of order log re, where n is the 
length of the time horizon, and no other "good" policy can (asymptotically) achieve a smaller 
regret; see also Auer et al. (2002). The elegance of the theory and sharp results developed in 
Lai and Robbins (1985) hinge to a large extent on the assumption of homogenous populations 
and hence identically distributed rewards. This, however, is clearly too restrictive for many 
applications of interest. Often, the decision maker observes further information and based on 
that a more customized allocation can be made. In such settings rewards may still be assumed 
to be independent, but no longer identically distributed in each arm. A particular way to encode 
this is to allow for an exogenous variable (a covariate) that affects the rewards generated by 
each arm at each point in time when this arm is pulled. 

Such a formulation was first introduced in Woodroofe (1979) under parametric assumptions 
and in a somewhat restricted setting; see Goldenshluger and Zeevi (2009) and Wang et al. (2005) 
for two very different recent approaches to the study of such bandit problems, as well as references 
therein for further links to antecedent literature. The first work to venture outside the realm 
of parametric modeling assumptions was that of Yang and Zhu (2002). In particular, they 
assumed the mean response in each arm, conditional on the covariate value, follows a general 
functional form, hence one can view their setting as as nonparametric bandit problem. They 
proposed a policy that is based on estimating each response function, and then, rather than 
greedily choosing the arm with the highest estimated mean response given the covariate, allows 
with some small probability of selecting a potentially inferior arm. (This is a variant of e-greedy 
policies; see Auer et al. (2002).) If the nonparametric estimators of the arms' functional response 
are consistent, and the randomization is chosen in a suitable manner, then the above policies 
ensure that the average regret tends to zero as the time horizon n grows to infinity. In the typical 
bandit terminology, such policies are said to be consistent. However, it is unclear whether they 
satisfy a more refined notion of optimality, insofar as the magnitude of the regret is concerned, 
as is the case for UCB-type policies in traditional bandit problems. Moreover, the study by 
Yang and Zhu (2002) does not spell out the connection between the characteristics of the class 
of response functions, and the resulting complexity of the nonparametric bandit problem. 

The purpose of the present paper is to further understanding of nonparametric bandit prob- 
lems, deriving regret-optimal policies and shedding light on some of the elements that dictate 
the complexity of such problems. We make only two assumptions on the underlying functional 
form that governs the arms' responses. The first is a mild smoothness condition. Smoothness 
assumptions can be exploited using "plug-in" policies as opposed "minimum contrast" policies; a 
detailed account of the differences and similarities between these two setups in the full informa- 
tion case can be found in Audibert and Tsybakov (2007). Minimum contrast type policies have 
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already received some attention in the bandit literature with side information, aka contextual 
bandits, in the papers of Langford and Zhang (2008) and also Kakade et al. (2008). In these 
studies, admissible policies are restricted to a more limited set than the general class of non- 
anticipating policies. A related problem online convex optimization with side information was 
studied by Hazan and Megiddo (2007), where the authors use discretization technique similar 
to the one employed in this paper. It isi worth noting that the cumulative regret in these papers 
is defined in a weaker form compared to the traditional bandit literature, since the cumulative 
reward of a proposed policy is compared to that of the best policy in a certain restricted class 
of policies. Therefore, bounds on the regret depend, among other things, on the complexity of 
said class of policies. Plug-in type policies have received attention in the context of the con- 
tinuum armed bandit problem, where as the nsame suggests there are uncountably many arms. 
Notable entries in that stream of work are Slivkins (2009) and Lu et al. (2009), who impose a 
smoothness condition both on the space of arms and the space of covariates, obtaining optimal 
regret bounds up to logarithmic terms. 

The second key assumption in our paper is a so-called margin condition, as it has been come 
be known in the full information setup; cf. Tsybakov (2004). In that setting, it has been shown 
to critically affect the complexity classification problems Tsybakov (2004); Boucheron et al. 
(2005); Audibert and Tsybakov (2007). In the bandit setup, this condition encodes the "sep- 
aration" between the functions that describe the arms' responses and was originally studied 
by Goldenshluger and Zeevi (2009) in the one armed bandit problem; see further discussion in 
section 2. We will see later that the margin condition is a natural measure of complexity in the 
nonparametric bandit problem. 

In this paper, we introduce a family of policies called UCBograms. The term is indicative 
of two salient ingredients of said policies: they build on regressogram estimators; and augment 
the resulting mean response estimates with upper-confidence-bound terms. The idea of the 
regressogram is quite natural and easy to implement. It groups the covariate vectors into 
bins and then estimates, by means of simple averaging, a constant which is a proxy for the 
mean response of each arm over each such bin. One then views these bins as indexing "local" 
bandit problems, which are solved by applying a suitable UCB-type modification, following 
the logic of Lai and Robbins (1985) and Auer et al. (2002). In other words, this family of 
policies decomposes the non-parametric bandit problem into a sequence of localized standard 
bandit problems; see section 3 for a complete description. The idea of binning covariates lends 
itself to natural implementation in the two motivating examples described earlier: patients and 
consumers are segmented into groups with "similar" characteristics; and then the treatment or 
ad is allocated based on the characteristic response over that group. 

In terms of performance, we prove that the UCBogram policies achieve a regret that is fairly 
large compared to typical orders of regret observed in the literature. In particular, as opposed 
to a bounded or logarithmic growth, in our setting the order of the regret is polynomial in the 
time horizon n; see Theorem 3.1. One may question, especially given the simple structure and 
logic underlying the UCBogram policy, whether this is the best that can be achieved in such 
problems. To that end, we prove a lower bound which demonstrates that for any admissible 
policy there exist arm response functions satisfying our assumptions for which one cannot im- 
prove on the polynomial order of the upper bound established in Theorem 3.1; see Theorem 
4.1. Finally, beyond these analytical results, in our view one of the contributions of the present 
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paper is in pointing to some possible synergies and potentially interesting connections between 
the traditional bandit literature and nonparametric statistics. 

2 Description of the problem 

2.1 Machine and game 

A bandit machine with covariates is characterized by a sequence 

(Xi,y«,y/')), t = i,2,... 

of independent random vectors, where (^t) , t = 1, 2, . . . is a sequence of iid covariates in X C M!^ 

(i) 

with probability distribution Px , and 1^ denotes the random reward yielded by arm i at time 
t. We assume that, for each i = 1,2, conditionally on {Xt = j}, the rewards Y^^\t = 1,. . . ,n 
are i.i.d random variables in [0, 1] with conditional expectation given by 

IE[y«|Xt] = /«(Xi), t = l,2,..., i = l,2, 

where f^^\i = 1,2, are unknown functions such that < f^^\x) < 1, for any i = 1,2, x ^ X. 
A natural example arises when 1^^*^ takes values in {0, 1} so that the conditional distribution of 
Y^^^ given Xt is Bernoulli with parameter f^^\Xt). 

The game takes place sequentially on this machine, pulling one of the two arms at each 
time t = l,...,n. A non- anticipating policy vr = {vrt} is a sequence of random functions 
T^t '■ X ^ {1)2} indicating to the operator which arm to pull at each time t, and such that tt^ 
depends only on observations strictly anterior to t. The oracle rule tt* , refers to the strategy 
that would be played by an omniscient operator with complete knowledge of the functions 
f^'^\i = 1,2. Given side information Xt, the oracle policy vr* prescribes the arm with the largest 
expected reward, i.e., 

^\Xt) :=arg max/»(XO. 

j=l,2 

The oracle rule will be used to benchmark any proposed policy vr and to measure the performance 
of the latter via its ( expected cumulative ) regret at time n defined by 

n n 

t=i t=i 
Also, let 5„(7r) denote the inferior sampling rate at time n defined by 

n 

SniTT) :=IEj]]I(7rt(X0 (1) 

t=l 

where 11(A) is the indicator function that takes value 1 if event A is realized and otherwise. 
The quantity 5„(7r) measures the expected number of times at which a strictly suboptimal arm 
has been pulled, and note that in our setting the suboptimal arm varies as a function of the 
covariate value x. 
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Without further assumptions on the machine, the game can be arbitrarily difficult and, as 
a result, the regret and inferior sampling rate can be arbitrarily close to n. In the following 
subsection, we describe natural assumptions on the regularity of the machine that allow to 
control its complexity. 

2.2 Smoothness and margin conditions 

As usual in nonparametric estimation we first impose some regularity on the functions f^''\i = 
1,2. Here and in what follows we use || • || to denote the Euclidean norm. 

Smoothness condition. We say that the machine satisfies the smoothness condition with 
parameters (/3, L) if 

\f(^{x)-f^^{x')\<L\\x-x'f, yx,x' e X,i = l,2 (2) 

for some f3 € (0, 1] and L > 0. 

Notice that a direct consequence of the smoothness condition with parameters L) is 
that the function A := \f^^^ — /^^^j also satisfies the smoothness condition with parameters 
(/3,2L). The behavior of function A critically controls the complexity of the problem and the 
smoothness condition gives a local upper bound on this function. The second condition imposed 
gives a lower bound on this function though in a weaker global sense. It is closely related to the 
margin condition employed in classification Tsybakov (2004); Mammen and Tsybakov (1999), 
which drives the terminology employed here. 

Margin condition. We say that the machine satisfies the margin condition with parameter a 
if there exists Sq € {0,1), Cs > such that 

Px[0< j/«(X)-/(2)(X)| <5] <C55", V5G [0,5o] 

for some a > 0. 

In what follows, we will focus our attention on marginals Px that are equivalent to the 
Lebesgue measure on a compact subset of IR'^. In that way, the margin condition will only 
contain information about the behavior of the function A and not the marginal Px itself. A 
large value of the parameter a means that the function A either takes value or is bounded 
away from 0, except over a set of small Px-probability. Conversely, for values of a close to 0, 
the margin condition is essentially void and the two functions can be arbitrary close, making it 
difficulty to distinguish among them. This will be reflected in the bounds on the regret which 
are derived in the subsequent section. 

Intuitively, the smoothness condition and the margin condition work in opposite directions. 
Indeed, the former ensures that the function A does not depart from zero too fast whereas the 
latter warrants the opposite. The following proposition accurately quantifies the extent to which 
the conditions are conflicting. 

Proposition 2.1 Under the smoothness condition with parameters {f3,L), any machine that 
satisfies the margin condition with parameter a such that a(3 > 1 exhibits an oracle policy vr* 
which dictates pulling only one of the two arms all the time, Px -almost surely. Conversely, if 
a/3 < 1 there exists machines with nontrivial oracle policies. 
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Proof. The first part of the proof is a straightforward consequence of Proposition 3.4 in 
Audibert and Tsybakov (2007). To prove the second part, consider the following example. 
Assume that d = I, X = [0,2], = and f^^\x) = Lsign(3; - - Notice that 

f^^^ satisfies the smoothness condition with parameters (/3, L) if and only if q/3 < 1. The oracle 
policy is not trivial and defined by 7r*(x) = 2 if x < 1 and tt*{x) = 1 if x > 1. Moreover, it 
can be easily shown that the machine satisfies the margin condition with parameter a and with 
do = Cs = l. ■ 



3 Policy and main result 

We first outline a policy to operate the bandit machine described in the previous section. Then 
we state the main result which is an upper bound on the regret for this policy. Finally, we state 
a proposition which allows us to translate the bound on the regret into a bound on the inferior 
sampling rate. 

3.1 Binning and regressograms 

To design a policy that solves the bandit problem described in the previous section, one has 
to inevitably find an estimate of the functions f^^\i = 1,2 at the current point Xf. There 
exists a wide variety of nonparametric regression estimators ranging from local polynomials to 
wavelet estimators. However, a very simple piecewise constant estimator, commonly referred to 
as regressogram will be particularly suitable for our purposes. 

Assume now that X = [0, 1]*^ and let {Bj,j = 1, . . . , M'^} be the regular partition of X, i.e., 
the reindexed collection of hypercubes defined for k = (fci, . . . , kd) G {1, . . . , M}"^ , 

Bi, = \x&X : ^^^^ <Xi< ^,i = l,...,d] . 
For each arm i = 1,2, consider the average reward for each bin Bj,j = 1, . . . , M'^ defined by 

f^^ = - [ f^'\x)dx, 

where pj = Px{Bj) . By analogy with histograms, the empirical counterpart of the piecewise 
constant function x 

Ei=i fj^Hx £ Bj), is often called regressogram. To define it, we need 

the following quantities. Let Nj:^^ {j, vr) denote the number of times vr prescribed to pull arm i 
at times anterior to t when the covariate was in bin Bj, 

t 

and let (j, vr) denote the average reward collected at those times, 

1 * 

Yf{j, tt) = -J- G B„ TT^iXs) = i) , 
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where here and throughout this paper, we use the convention 1/0 = oo. For any arm i = 1,2 
and any time t > 1 the regressograms obtained from a pohcy vr at time t are defined by the 
following piecewise constant estimators 

i=i 

While regressograms are rather rudimentary nonpar ametric estimators of the functions f^^\ they 
allow us to decompose the original problem into a collection of M"' traditional bandit machines 
without covariates, each one corresponding to a different bin. 

3.2 The UCBogram 

The "UCBogram" is an index type policy based on upper confidence bounds for the regressogram 
defined above. Upper confidence bounds (UCB) policies are known to perform optimally in 
the traditional two armed bandit problem, i.e., without covariates Lai and Robbins (1985); 
Auer et al. (2002). The index of each arm is computed as the sum of the average past reward 
and a stochastic term accounting for the deviations of the observed average reward from the 
true average reward. In the UCBogram, the average reward is simply replaced by the value of 
the regressogram at the current covariate Xf. 

For any s > 1 the upper confidence bound at time t bound is of the form 

/21ogi 

The UCBogram vr is defined as follows. For any x G [0, 1]"^, define 

iV«(x) = ^iV«(j,vr)]I(xGS,), 
i=i 

the number of times the UCBogram prescribed to pull arm i at times anterior to t when the 
covariate was in the same bin as x. Then vr = (7ri,7r2, . . .) is defined recursively by 

TTtix) = arg max (/^^(x) + [/t(7Vf ^ (x))| . 

1=1,2 ^ ' J 

Notice that the UCBogram is indeed a UCB-type policy. Indeed, for each arm i = 1,2 and 
at each point x, it computes an estimator ft^l{x) of the expected reward and adds an upper 

confidence bound Ut{N^^\x)) to account for stochastic variability in this estimator. The most 
attractive feature of the regressogram is that it allows to decompose the nonparametric bandit 
problem into independently operated local machines as detailed in the proof of the following 
theorem. 
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Theorem 3.1 Fix P £ [0,1], L > and a e (0, 1]. Let X = [0, 1]"^ and assume that the covari- 
ates Xt have a distribution which is equivalent^ to the Lebesgue measure on the unit hypercube 
X. Let the machine satisfy both the smoothness condition with parameter {13, L) and the margin 
condition with parameter < a < 1. Then the UCBogram policy vr with M = [(n/ log n)^/^^'^"'"'^^] 
has an expected cumulative regret at time n bounded by 

/3(q + 1) 2P 

i?„(vr)<Cnmax{(-^)-^^- '(tT^)"'"}' 
LViogn/ V loen r/ J 



where C > is a positive constant. 



Proof. To keep track of positive constants, we number them ci, C2, . . .. Define ci = 2Ld^/'^ + 1, 
and let no > 2 be the largest integer such that 

no Y''^ < 



log no/ (5o 

where Jo is the constant appearing in the margin condition. If n < no, we have i?„ < no so that 
the result of the theorem holds when C is chosen large enough, depending on the constant no- 
In the rest of the proof, we assume that n > ng so that c\M~^ < Sq. 

Recall that the UCBogram policy vr is a collection of functions vr^ that are constant on each 
Bj, equal to ^t{j)- Define the regret Rj(vr) on bin Bj by 

n 

t=l 

and observe that the overall regret of tt can be written as 

J=l 

Consider the set of "well behaved" bins on which the expected reward functions of the two arms 
are well separated: 

J = {j ■.3x£B,, - > ciM-^} . 

For any j ^ J and any x £ Bj, we have \ f'^^\x) — f^'^\x)\ < ciM~^ < 5q so that 

n 

IER,(7r) < ciM-'^ J^F[0 < \f^'HXt) - f^^HXt)\ < ciM-^,Xt G B,] , 
t=i 

Summing over j ^ J', we obtain from the margin condition that 

lERj(^) < C5cJ+"nM-^(i+") . (3) 



^Two measures n and v are said to be equivalent if there exist two positive constants c and c such that 
^ i^(^) ^ for s-ny rneasurable set A. 
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We now treat the well behaved bins, i.e., bins Bj such that j € J'. Notice that since each 
bin is a hypercube with side length 1/M and since the reward functions satisfy the smoothness 
condition with parameters {P,L), we have 

1/(1) (a;) _ > ciM-f^ - 2Ldf^/^M-^ = M-^ , 

for any x G Bj^j € J. In particular, for such j, since the two functions are continuous, the 
difference /^^^ (x) — /^^^ (x) has constant sign over Bj and \ff'^-ff \ > M-^. As a consequence, 
the oracle policy vr* is constant on Bj^ equal to '/r*(j) for any j ^ J and, conditionally on 
{Xt G Sj}, the game can be viewed as a standard bandit problem, i.e., without covariates, where 

arm i has bounded reward with mean . Moreover, conditionally on {Xt G -Bj}, the UCBogram 
can be seen as a standard UCB policy. Applying for example Theorem 1 in Auer et al. (2002), 
we find that for j G J ^ 



IERj(7r) < 



(1 . ^)A, 



8 log n log n 
H 1 < C2- 



A, 



(4) 



where Aj = \]f^ I is the average gap in bin Bj. We now use the margin condition to provide 
lower bounds on Aj. Assume without loss of generality that the gaps are ordered < Ai < A2 < 
. . . , < Ajyjd and define the integers ji, j2 such that J = {ji, . . . , M'^} and j2 G {ji, . . . , M'^} is 
the largest integer such that Ajj < 5q/ci. Therefore, for any j G {ji, ■ ■ ■ , 32} C J , we have on 
the one hand, 



Px[Q< 1/^') - /^'^l < Aj + (ci - 1)M"^] > ^pa(0 < Afc < A,) > 



k=l 



(5) 



where we use the fact that pk = Px{Bk) > c/M*^ since Px is equivalent to the Lebesgue measure 
on [0, l]'^ (see footnote 1). On the other hand, the margin condition yields for any j G {ji, . . . , J2} 
that, 

(6) 



Px[0< 1/(1) - < A, + (ci - 1)M-^] < C5(ciA,)" . 
where we used the fact that Aj + (ci — 1)M~^ < ciAj < 60, for any j G {ji, , 



previous two inequalities yield 



j \ l/a 



V J G {ji,...,i2}- 



,32}. The 
(7) 



Combining (3), (4) and (7), we obtain the following bound. 



n 



Rnirr) < C4 ?iM-^(i+°) + jiM-^ + (log?i) ^ 



j=ji 



l/a 



+ A'f^ log n 



(8) 



Note that applying the same arguments as in (5) and (6), we find that ji satisfies 

< ^'x [0 < l/W - < ciM-^] < Cs{ciM-^r , 



031 
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so that ji < c^M'^ We now bound from above the sum in (8) using the fohowing integral 
approximation: 



E(-) sE(- 



j=ji 



j=ji 



J 



(9) 



M- 



If a < 1, this integral is bounded by cqMI^^^ "'^ and if a = 1, it is bounded by cylogM. As a 
result, the integral in (9) is of order M'^(M^(^~") V logM) and we obtain from (8) that 



^n(^) < C8 [nM-^(^+") + M'^(M^(^-") V log Af) log 
and the result follows by choosing M as prescribed. 



n 



(10) 



We should point out that the version of the UCBogram described above specifies the number 
of bins M as a function of the horizon n, while in practice one does not have foreknowledge of 
this value. This limitation can be easily circumvented by using the so-called doubling argument 
Cesa-Bianchi and Lugosi (2006) which consists of "reseting" the game at times 2^', k = 1,2, . . . 

The reader will note that when a = 1 there is an additional log?i factor appearing in the 
upper bound given in the statement of the theorem. More generally, for any a > 1, it is 
possible to minimize the expression on the right hand side of (10) with respect to M, but the 
optimal value of M would then depend on the value of a. This sheds some light on a significant 
limitation of the UCBogram which surfaces in this parameter regime: it requires the operator 
to pull each arm at least once in each bin and therefore to incur a regret of at least order M'^. 
In other words, the UCBogram splits the space X in "too many" bins when a > 1. Intuitively 
this can be understood as follows. When a = 1, the gap function A{x) is bounded away from 
zero for most x € <Y. For such x, there is no need to carefully estimate the gap function since 
it has constant sign for "large" contiguous regions. As a result one could use larger bins in 
such regions reducing the overall number of bins and therefore removing the extra logarithmic 
term. Of course, such limitations are intrinsic to the UCBogram and may not appear with other 
policies but it is beyond the scope of this paper. 



3.3 The inferior sampling rate 

Unlike traditional bandit problems, the connection between the inferior sampling rate defined in 
(1) and the regret is more intricate here. The following lemma establishes a connections between 
the two. 



Lemma 3.1 For any a > 0, under the margin condition we have 
for any policy ir and for some positive constant C > 0. 
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Proof. The idea of the proof is quite standard and originahy appeared in Tsybakov (2004). It 
has been used in Rigollet and Vert (2009) and Goldenshluger and Zeevi (2009). Define the two 
random quantities: 

n 

rn(vr) = \f^'\Xt) - /(2)(X,)|I(7ri(X,) / 7:\Xt)) , 

i=l 

and 

n 

t=i 

We have 



rn(vr) > <5^I(7ri(X0/7r'^(Xi))3(|/«(X0-/(')(Xi)|>5) 
t=i 

n 

> 6[sn{7r) - Y HMXt) + ^'^(^0,0 < |/(i)(Xi) - /(2)(X0| < S)\ 

t=l 

n 

> 4s„(7r)-^2(0<|/W(X0-/(2)(Xi)|<5)]. (11) 



t=i 



Taking expectations on both sides of (11), we obtain that i?n(vr) > 5[5'„(-7r) — niJ"] , where we 
used the margin condition. The proof fohows by choosing 6 = (^^(vr)/^)^/" for c > 2 large 
enough to ensure that d < 6o ■ 

Using Lemma 3.1, we obtain the following corollary of Theorem 3.1 

Corollary 3.1 Fix (3 G (0,1], L > and a S (0,1]. Under the conditions of Theorem 3.1, 
the UCBogram policy fc with M = [{n/ logny^^'^^^'^^ \ has an inferior sampling rate at time n 
bounded by 



0a 

IT- \ 20+d 



Aog n 

where C > is a positive constant. 

4 Lower bound 

While the UCBogram is a very simple policy, it still provides good insights as to how to construct 
a lower bound on the regret for incurred by any admissible policy. Indeed, the main result of this 
section demonstrates the polynomial rate of the upper bounds in Theorem 3.1 and Corollary 3.1 
is optimal in a minimax sense, for a large class of conditional reward distributions. Define 
the Kullback-Leibler (KL) divergence between P and Q, where P and Q are two probability 
distributions by 

WQ) = ( ^^°^(^)''' 

I oo otherwise. 
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Denote by -fj(x) conditional distribution of y^*^ given X for any i = 1,2 and assume that 
there exists > such that for any 9,9' E the KL divergence between P^^^ and Pg^^ satisfies 

IC{P^'\P^9)<^{9-9T. (12) 

Assumption (12) is similar to Assumption (B) employed in Tsybakov (2009, Section 2.5) but 
does not require absolute continuity with respect to the Lebesgue measure. A direct consequence 
of the following lemma is that Assumption (12) is satisfied when Pg is a Bernoulli distribution 
with parameter 9 € (0, 1). 

Lemma 4.1 For any a G [0, 1] and b G (0, 1) let Pa and Pi, denote two Bernoulli distributions 
with parameters a and b respectively. Then 

In particular, if bQ G [0,1/2), Assumption (12) is satisfied with k} = 1/4 — 6q, for any a G 
[0,1],6g [1/2- 60,1/2 + 60] • 

Proof. From the definition of the KL divergence, we have 

'a\ fl — a\ /a — b\ fa — b\ {a — b)'^ 



/C(P„ n) = a log Q + (1 - a) log (i-^) < a(^) - (1 - a) 
where in the second line we used the inequality log(l + u) < u. 



6(1 - 6) 



Theorem 4.1 Fix a, p, L > such that afi < 1 and let X = [0, 1]"^. Assume that the covariates 
Xt are uniformly distributed on the unit hypercube X and that there exists t G (0,1/2) such 
that {Pg^ , 9 G [1/2 — T, 1/2 + r]} satisfies equation (12) for i = 1,2. Then, there exists a 
pair of reward functions f^^\i = 1,2 that satisfy both the smoothness condition with parameters 
(/3, L) and the margin condition with parameter a, such that for any non- anticipating policy vr 
the regret is bounded as follows 

Rn{TT) > Cn'-^^ , (13) 
and the inferior sampling rate is bounded as follows 

Sn{7r) > Cn~^^ , (14) 

for some positive constant C . 

Proof. To simplify the arguments below, it will be useful to denote arm 2 by —1. Finally, with 
slight abuse of notation, we use 5'n(vr, f^^\ f^~^^) to denote the inferior sampling rate at time n 
that is defined in (1), making the dependence on the mean reward functions explicit. 
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In view of Lemma 3.1, it is sufficient to prove (14). To do so we reduce our problem to a 
hypothesis testing problems; an approach this is quite standard in the nonparametric literature, 
cf. (Tsybakov, 2009, Chapter 2). For any policy n, and any t = 1, . . . ,n, denote by j the 
joint distribution of the collection of pairs 

where ]E[y(i)|X] = f{X) and lE[y(-^)|X] = 1/2. Let lE^j denote the corresponding expecta- 
tion. It follows that the oracle policy ttJ is given by 7rj(x) = sign[/(x)] with the convention that 
sign(O) = 1. Fix 5q G (0,1) as in the definition of the margin condition. We now construct a 
class C of functions / : — )• [0, 1] such that / satisfies (2) and 

Px[0<\f{X)-l/2\<6] r^Csd'', V5e [0,5o], 

As a result, the machine characterized by the expected rewards f^^^ = f and f^~^^ = 1/2 satisfies 
both the smoothness and the margin conditions. Moreover, we construct C in such a way that 
for any policy vr 

' n " — - — 



sup5„(7r,/,l/2) >Cn(-^) . (15) 

/6C Vlogn/ 

for some positive constant C. Consider the regular grid Q = {(/i, . . . ,qM<i]i where denotes 
the center of bin Bj^, k = 1, . . . , M'^, for some M > 1 to be defined. Define = min(L, r, 1/4) 
and let d>R : IR"' — )• IR-l be a smooth function defined as follows: 



r (1 - ||x||oo)^ if < ||x||oo < 1, 

■"^^ I if ll^lloo > 1. 



Clearly, we have \C^4>p{x) — C(^0^(2;')| < L\\x — x'\\^ < L\\x — x'\\^ for any x,x' G IR'^. 

Define the integer m = [/iM'^""'^], i.e., the smallest integer that is larger than or equal to 
fiM'^~°'^, where fi € (0, 1) is chosen small enough to ensure that m < W^. Define Qrn = {-1, l}"" 
and for any uj € Qm, define the function on [0, 1]"^ by 

m 

fU^) = 1/2 + ^ujj(pj{x) , 
i=i 

where ^Pj{x) = M~f^ C^(j){M[x — qj])'!L{x € Bj). Notice in particular that fuj{x) = 1/2 if and only 
if X E X \ [jp=i *° ^ °^ Lebesgue measure. We are now in position to define the 

family C as 
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Note first that any function /a; G C satisfies the smoothness condition (2). We now check that 
the margin condition is satisfied with parameter a. For any to € Qm, we have 

m 

Px{0 < \UX) - 1/2| < C^6) = ^Px{0 < \UX) - 1/2| < C^6,X G B,) 

= mPxiO < (j){M[X - qi]) < 5M^,X G ^i) 
= m 



I(0(Mx) < 5Ml^)dx 

where in the third equality, we used the fact that Px denotes the uniform distribution on [0, 1]*^. 
Now, since (p is non negative and uniformly bounded by 1, we have on the one hand that for 
5Mf^ > 1, 

/ l{(l){x) < 5M^)dx = 1 . 
On the other hand, when 6M^ < 1, we find 

/ licpix) < 5M'^)dx = 1 - / m\x\\oo < 1 - M6^^'^)dx = 1 - (l - Md^^'^Y ^ dM5^'^ . 
It yields 

Px{^ < \ fUX) - 1/2| < C^d) < mM~'^l{6M'^ > 1) + mdM^-'^6^/'^l{6M^ < 1)) 

< (1 + , 

where we used the fact that 1 — a/3 > to bound the second term in the last inequality. Thus, 
the margin condition is satisfied for any Sq and with C5 = (1 + d)/C^. 

We now prove (15) by observing that if we denote lo = (wi, . . . ,ujm) G il-m, we have 

n 

sup5„(7r,/W,l/2) = snp Y^lEl^lPx IMXt) ^ signiUXt))] 
fee ujeUm 



= sup E ^tT/L^-^ [-^tiXt) ^ uj,,Xt G 

^ ^ E E E E^l^-^ [-*(^*) + ^^^^^ ^ (16) 



2^^ 

j = l t=l 



Observe now that for any j = l,...,m, the sum ^^gf^i---] in the previous display can be 
decomposed as 

E E Px[^t{Xt)^i,XteBj], 
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where ui. 



{uji,...,Uj_i,u)j+i,...,uJm) and uj'^_j^ = {oji, . . . ,ujj_i,i,u)j+i, . . . ,u}m) for 



—1,1. Using Theorem 2.2(in) of Tsybakov (2009), and denoting by Pxi') the conditional 
distribution Px{-\X G Bj), we get 



> 



1 
1 

4M^ 



ie{-i,i} 



exp 



exp 



"i-j] 



(17) 



For any f = 2, . . . , n, let denote the cr-algebra generated by the information available at time 
t immediately after observing Xt, i.e., J^t = (^{Xt, {XajY^"^^"^^), s = 1, . . . , t — 1)) . Define the 

conditional distribution lPj"j' of the random couple (X^, Y^^'^*^^'^^), conditioned on Tf. Denote 
also by Ext the expectation with respect to the marginal distribution of Xt- Applying the chain 
rule for KL divergence, we find that for any t = 1, . . . ,n and any f,g:X—^[0,l], we have 
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where P *r 

1"./ 



denotes the conditional distribution of y^"^^^^^"^^ given Tt- Since, for any 



f €C,we have that E[y/^'^^*^^ | J^] = /(^'(^*))(Xt) € [1/2 - r, 1/2 + r], we can apply (12) to 
derive the following upper bound: 

< -irClM-^^l {TTtiXt) = l,Xt€ Bj) 



< 



l{7rt{Xt) = l,XteB, 



By induction, the last two displays yield that for any t = 1, . . . , re, 



(18) 



where 



Y,'S^in{X) = l,X GBj) 



denotes the expected number of times t between time 1 and time re that Xt € Bj and 'Kt{Xt) = 1. 
Combining (17) and (18), we get 



-ym—l 



exp 



4^2 



(19) 
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On the other hand, from the definition of Q*-, we clearly have 



>2— iN,-^. (20) 



i=l 



Plugging the lower bounds (19) and (20) into (16) yields 



sup5„(7r,/«,l/2) > ^ max I ^exp(^-^-^N,,^J ,N,,^| 



> — mt < 7 exp —z + z 



Notice now that 



n 



is strictly positive if and only if n > 16k;^M^'''^'^, in which case 

z* = \>^M''Ho^' 

Taking 



M 



n 



1 • 

2/9+d 



2g 

gives z* = c*n2/3+d for some positive constant c*, so that 



sup5„(^,/(^\l/2) > Cmz* > Cn~W. 



This completes the proof. 



Notice that the rates obtained in Theorem 4.1, can be obtained in the full information case, 
where the operator observes the whole i.i.d sequence {Xi,Y-^^\Y-^'^^),i = 1, . . . ,n, even before 
the first round. Indeed, such bounds have been obtained by Audibert and Tsybakov (2007) in 
the classification setup, i.e., when the rewards are Bernoulli random variables. However, we 
state a different technique, tailored for bandit policies in a partial information setup. While 
the final result is the same, we believe that it sheds light on the technicalities encountered in 
proving such a lower bound. 
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